A person can not have zero values for Glucose, Bloodpressure, SkinThickness, Insulin, BMI and Diabetes Pedigress Function. All these zero values don't make any sense hence these are nothing but the missing values. So we'll treat them with missing values imputation techniques

Let's see the distribution of data points in order to fill the null values.

Filling Missing Values

Now our dataset is free from any null values so we can proceed further

Pair Plot after handeling missing values

Count of types of columns in dataset

Count of diabetic and healty people in dataset

Heatmap of Original Dataset

Heatmap of Clean Data

From the above heatmap we see a bit of correlation between some columns i.e.

Age and Pregnancies = 0.54 Glucose and insulin = 0.49 SkinThickness and BMI = 0.57 Let's create some scatter plots for above mentioned column pairs to understand the relationship among the top correlation values:

Data Split for training and testing

Standardization To bring the whole data at a same scale we'll perform satnadardization.

  1. LogisticRegression Model
  1. RandomForest Classifier
  1. Support Vector Machine
  1. K Nearest Neighbour

From above results this could be concluded that n=7 gives the best results so we'll take n_neighbors = 7 for final model

  1. Decision Tree

In a Nutshell Accuracy , Sensitivity and Specificity

Combined ROC Curve for all the models

So the above curve gives the max area under the curve for Random Forest Classifier, hence this could be the best algorithm for the problem.